-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Move original_max_position_embeddings to rope params
#42513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e TODs from Joao
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…me we init rope comute fn
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
This comment contains models: ["models/deepseek_v3", "models/gemma2", "models/gemma3", "models/llama", "models/mistral", "models/phi", "models/phi3", "models/qwen2", "models/qwen2_vl"] |
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
CI ResultsModel CI Report❌ Failed tests
|
| original_max_position_embeddings (`int`, *optional*): | ||
| Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during | ||
| Used with 'yarn', 'longrope' and 'llama3'. The original max position embeddings used during | ||
| pretraining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamic uses config.max_position_embedding and doesn't require us to set explicit original_max_position_embeddings in rope dict
| def test_model_rope_scaling_frequencies(self): | ||
| """Tests the frequency properties of the different RoPE scaling types on the model RoPE layer.""" | ||
| config, _ = self.model_tester.prepare_config_and_inputs_for_common() | ||
| config.layer_types = ["full_attention", "sliding_attention"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the below code sets rope params for two layer types, but the dummy config doesn't always get init with both. This line makes sure that layer types are in line with rope params
| rope_type = self.rope_type | ||
| original_inv_freq = self.original_inv_freq | ||
| prefix = "" | ||
| original_max_position_embeddings = self.config.rope_parameters["original_max_position_embeddings"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe to assume it already exists. We move original_max_position_embeddings to its correct location at config init time
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gemma3 |
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was reading the slow-PR broken and thought why not give a quick review. Nice refactor, thanks !
| if rope_parameters["rope_type"] in ["llama3", "yarn", "longrope"]: | ||
| if hasattr(self, "original_max_position_embeddings"): | ||
| # NOTE: Phi3 (and potentially other models) save `original_max_position_embeddings` field | ||
| # containing the pretrained value outside rope parameters. This is an exception case where we | ||
| # give priority to `self.original_max_position_embeddings | ||
| self.rope_parameters["original_max_position_embeddings"] = self.original_max_position_embeddings | ||
| else: | ||
| self.rope_parameters.setdefault("original_max_position_embeddings", self.max_position_embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious how we were handling this before? not a big worry as we were doing so much implicit stuff before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we would handle it in each compuet_XXX_rope_freqs and each had a slightly different way. Llama3 assumes that rope_parameters["original_max_position_embeddings"] always exists, Yarn would try a fallback with config.max_position_embedding and Longrope has a priority check for config.original_max_position_embeddings
That is why we have here a first condition, which is the priority check from Longrope. A few models were saved inconsistently causing us a a headache now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the nightmare yes. Thanks for explaining, useful for future us too hehe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful for vLLM too, which has been applying this patch manually for a while
| "post-yarn context length / pre-yarn context length = " | ||
| "config.max_position_embeddings / config.rope_parameters['original_max_position_embeddings'] = " | ||
| f"{implicit_factor}). Using the explicit factor ({factor}) in YaRN. This may cause unexpected " | ||
| "behaviour in model usage, please correct the 'original_max_position_embeddings' fields in the model config." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users could also unset factor no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we obtain it as rope_parameters["factor"] so users would need to explicitly set it to None in that case. IMO we better nudge users to be consistent and not rely on implicitly inferred factor values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair. Functionally it'll be equivalent but agree let's not encourage config-breaking headache-inducing behaviours
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
1 similar comment
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
💔 This comment contains |
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
1 similar comment
|
run-slow: phi3, phi, llama, mistral, mistral, qwen2_vl, deepseek_v3, qwen2, gemma2, gemma3 |
|
This comment contains models: ["models/deepseek_v3", "models/gemma2", "models/gemma3", "models/llama", "models/mistral", "models/phi", "models/phi3", "models/qwen2", "models/qwen2_vl"] |
CI ResultsModel CI Report❌ Failed tests
|
|
oke, i broke phi longrope apparently |
What does this PR do?
As per title, resolves the TODO from Joao and moves patching for
original_max_position_embeddingsinside rope dict standardization. That way,original_max_position_embeddingsis moved to the correct field once at init time and we can delete similar patches from individual rope func. Note that this is not a breaking change, instead we move near-duplicate code to a single placecc @hmellor